Goto

Collaborating Authors

 tree-based method


Efficient Subgroup Analysis via Optimal Trees with Global Parameter Fusion

Xie, Zhongming, Giorgio, Joseph, Wang, Jingshen

arXiv.org Machine Learning

Identifying and making statistical inferences on differential treatment effects (commonly known as subgroup analysis in clinical research) is central to precision health. Subgroup analysis allows practitioners to pinpoint populations for whom a treatment is especially beneficial or protective, thereby advancing targeted interventions. Tree based recursive partitioning methods are widely used for subgroup analysis due to their interpretability. Nevertheless, these approaches encounter significant limitations, including suboptimal partitions induced by greedy heuristics and overfitting from locally estimated splits, especially under limited sample sizes. To address these limitations, we propose a fused optimal causal tree method that leverages mixed integer optimization (MIO) to facilitate precise subgroup identification. Our approach ensures globally optimal partitions and introduces a parameter fusion constraint to facilitate information sharing across related subgroups. This design substantially improves subgroup discovery accuracy and enhances statistical efficiency. We provide theoretical guarantees by rigorously establishing out of sample risk bounds and comparing them with those of classical tree based methods. Empirically, our method consistently outperforms popular baselines in simulations. Finally, we demonstrate its practical utility through a case study on the Health and Aging Brain Study Health Disparities (HABS-HD) dataset, where our approach yields clinically meaningful insights.


Learning based on neurovectors for tabular data: a new neural network approach

Husillos, J. C., Gallego, A., Roma, A., Troncoso, A.

arXiv.org Artificial Intelligence

--In this paper, we present a novel learning approach based on Neurovectors, an innovative paradigm that structures information through interconnected nodes and vector relationships for tabular data processing. Unlike traditional artificial neural networks that rely on weight adjustment through backpropagation, Neurovectors encode information by structuring data in vector spaces where energy propagation, rather than traditional weight updates, drives the learning process, enabling a more adaptable and explainable learning process. Our method generates dynamic representations of knowledge through neurovectors, thereby improving both the interpretability and efficiency of the predictive model. Experimental results using datasets from well-established repositories such as the UCI machine learning repository and Kaggle are reported both for classification and regression. T o evaluate its performance, we compare our approach with standard machine learning and deep learning models, showing that Neurovectors achieve competitive accuracy while significantly reducing computational costs. Machine learning has significantly advanced in recent decades, with models such as deep neural networks (DNNs), decision tree-based methods, and probabilistic models [1], [2], among others. Despite their widespread success, these approaches present limitations in terms of preprocessing requirements, interpretability, and computational efficiency. Deep neural networks have demonstrated remarkable performance in various domains--from computer vision to natural language processing [3].


RO-FIGS: Efficient and Expressive Tree-Based Ensembles for Tabular Data

Matjašec, Urška, Simidjievski, Nikola, Jamnik, Mateja

arXiv.org Artificial Intelligence

Tree-based models are often robust to uninformative features and can accurately capture non-smooth, complex decision boundaries. Consequently, they often outperform neural network-based models on tabular datasets at a significantly lower computational cost. Nevertheless, the capability of traditional tree-based ensembles to express complex relationships efficiently is limited by using a single feature to make splits. To improve the efficiency and expressiveness of tree-based methods, we propose Random Oblique Fast Interpretable Greedy-Tree Sums (RO-FIGS). RO-FIGS builds on Fast Interpretable Greedy-Tree Sums, and extends it by learning trees with oblique or multivariate splits, where each split consists of a linear combination learnt from random subsets of features. This helps uncover interactions between features and improves performance. The proposed method is suitable for tabular datasets with both numerical and categorical features. We evaluate RO-FIGS on 22 real-world tabular datasets, demonstrating superior performance and much smaller models over other tree- and neural network-based methods. Additionally, we analyse their splits to reveal valuable insights into feature interactions, enriching the information learnt from SHAP summary plots, and thereby demonstrating the enhanced interpretability of RO-FIGS models. The proposed method is well-suited for applications, where balance between accuracy and interpretability is essential.


Exploring space efficiency in a tree-based linear model for extreme multi-label classification

Lin, He-Zhe, Liu, Cheng-Hung, Lin, Chih-Jen

arXiv.org Artificial Intelligence

Extreme multi-label classification (XMC) aims to identify relevant subsets from numerous labels. Among the various approaches for XMC, tree-based linear models are effective due to their superior efficiency and simplicity. However, the space complexity of tree-based methods is not well-studied. Many past works assume that storing the model is not affordable and apply techniques such as pruning to save space, which may lead to performance loss. In this work, we conduct both theoretical and empirical analyses on the space to store a tree model under the assumption of sparse data, a condition frequently met in text data. We found that, some features may be unused when training binary classifiers in a tree method, resulting in zero values in the weight vectors. Hence, storing only non-zero elements can greatly save space. Our experimental results indicate that tree models can achieve up to a 95% reduction in storage space compared to the standard one-vs-rest method for multi-label text classification. Our research provides a simple procedure to estimate the size of a tree model before training any classifier in the tree nodes. Then, if the model size is already acceptable, this approach can help avoid modifying the model through weight pruning or other techniques.


Reviews: REFUEL: Exploring Sparse Features in Deep Reinforcement Learning for Fast Disease Diagnosis

Neural Information Processing Systems

The authors describe an RL architecture comprised of reward shaping plus representation learning that is used to solve an active classification problem, framed as "diagnosis." In this setting, an agent can measure the value of "symptoms" at some cost, and eventually makes a prediction of what disease is present. The architecture is intended to take advantage of the property that symptoms are sparse but correlated. Reward shaping is used to help the agent learn to quickly find symptoms that are present, while the correlations are used to avoid having the agent measure symptoms that are already "known" based on already-measured ones with high certainty. Experimental results demonstrate a substantial improvement over prior work.


Ranking Perspective for Tree-based Methods with Applications to Symbolic Feature Selection

Luo, Hengrui, Li, Meng

arXiv.org Machine Learning

Tree-based methods are powerful nonparametric techniques in statistics and machine learning. However, their effectiveness, particularly in finite-sample settings, is not fully understood. Recent applications have revealed their surprising ability to distinguish transformations (which we call symbolic feature selection) that remain obscure under current theoretical understanding. This work provides a finite-sample analysis of tree-based methods from a ranking perspective. We link oracle partitions in tree methods to response rankings at local splits, offering new insights into their finite-sample behavior in regression and feature selection tasks. Building on this local ranking perspective, we extend our analysis in two ways: (i) We examine the global ranking performance of individual trees and ensembles, including Classification and Regression Trees (CART) and Bayesian Additive Regression Trees (BART), providing finite-sample oracle bounds, ranking consistency, and posterior contraction results. (ii) Inspired by the ranking perspective, we propose concordant divergence statistics $\mathcal{T}_0$ to evaluate symbolic feature mappings and establish their properties. Numerical experiments demonstrate the competitive performance of these statistics in symbolic feature selection tasks compared to existing methods.


Regression Trees Know Calculus

Wycoff, Nathan

arXiv.org Machine Learning

Regression trees have emerged as a preeminent tool for solving real-world regression problems due to their ability to deal with nonlinearities, interaction effects and sharp discontinuities. In this article, we rather study regression trees applied to well-behaved, differentiable functions, and determine the relationship between node parameters and the local gradient of the function being approximated. We find a simple estimate of the gradient which can be efficiently computed using quantities exposed by popular tree learning libraries. This allows the tools developed in the context of differentiable algorithms, like neural nets and Gaussian processes, to be deployed to tree-based models. To demonstrate this, we study measures of model sensitivity defined in terms of integrals of gradients and demonstrate how to compute them for regression trees using the proposed gradient estimates. Quantitative and qualitative numerical experiments reveal the capability of gradients estimated by regression trees to improve predictive analysis, solve tasks in uncertainty quantification, and provide interpretation of model behavior.


Adapting tree-based multiple imputation methods for multi-level data? A simulation study

Gurtskaia, Ketevan, Schwerter, Jakob, Doebler, Philipp

arXiv.org Machine Learning

This simulation study evaluates the effectiveness of multiple imputation (MI) techniques for multilevel data. It compares the performance of traditional Multiple Imputation by Chained Equations (MICE) with tree-based methods such as Chained Random Forests with Predictive Mean Matching and Extreme Gradient Boosting. Adapted versions that include dummy variables for cluster membership are also included for the tree-based methods. Methods are evaluated for coefficient estimation bias, statistical power, and type I error rates on simulated hierarchical data with different cluster sizes (25 and 50) and levels of missingness (10\% and 50\%). Coefficients are estimated using random intercept and random slope models. The results show that while MICE is preferred for accurate rejection rates, Extreme Gradient Boosting is advantageous for reducing bias. Furthermore, the study finds that bias levels are similar across different cluster sizes, but rejection rates tend to be less favorable with fewer clusters (lower power, higher type I error). In addition, the inclusion of cluster dummies in tree-based methods improves estimation for Level 1 variables, but is less effective for Level 2 variables. When data become too complex and MICE is too slow, extreme gradient boosting is a good alternative for hierarchical data. Keywords: Multiple imputation; multi-level data; MICE; missRanger; mixgb


Evaluating tree-based imputation methods as an alternative to MICE PMM for drawing inference in empirical studies

Schwerter, Jakob, Gurtskaia, Ketevan, Romero, Andrés, Zeyer-Gliozzo, Birgit, Pauly, Markus

arXiv.org Machine Learning

Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures. The performance and validity of such methods are of great importance for their application in empirical studies. While the prevailing method of Multiple Imputation by Chained Equations (MICE) with Predictive Mean Matching (PMM) is considered standard in the social science literature, the increase in complex datasets may require more advanced approaches based on machine learning. In particular, tree-based imputation methods have emerged as very competitive approaches. However, the performance and validity are not completely understood, particularly compared to the standard MICE PMM. This is especially true for inference in linear models. In this study, we investigate the impact of various imputation methods on coefficient estimation, Type I error, and power, to gain insights that can help empirical researchers deal with missingness more effectively. We explore MICE PMM alongside different tree-based methods, such as MICE with Random Forest (RF), Chained Random Forests with and without PMM (missRanger), and Extreme Gradient Boosting (MIXGBoost), conducting a realistic simulation study using the German National Educational Panel Study (NEPS) as the original data source. Our results reveal that Random Forest-based imputations, especially MICE RF and missRanger with PMM, consistently perform better in most scenarios. Standard MICE PMM shows partially increased bias and overly conservative test decisions, particularly with non-true zero coefficients. Our results thus underscore the potential advantages of tree-based imputation methods, albeit with a caveat that all methods perform worse with an increased missingness, particularly missRanger.


NSOTree: Neural Survival Oblique Tree

Sun, Xiaotong, Qiu, Peijie

arXiv.org Machine Learning

Survival analysis is a statistical method employed to scrutinize the duration until a specific event of interest transpires, known as time-to-event information characterized by censorship. Recently, deep learning-based methods have dominated this field due to their representational capacity and state-of-the-art performance. However, the black-box nature of the deep neural network hinders its interpretability, which is desired in real-world survival applications but has been largely neglected by previous works. In contrast, conventional tree-based methods are advantageous with respect to interpretability, while consistently grappling with an inability to approximate the global optima due to greedy expansion. In this paper, we leverage the strengths of both neural networks and tree-based methods, capitalizing on their ability to approximate intricate functions while maintaining interpretability. To this end, we propose a Neural Survival Oblique Tree (NSOTree) for survival analysis. Specifically, the NSOTree was derived from the ReLU network and can be easily incorporated into existing survival models in a plug-and-play fashion. Evaluations on both simulated and real survival datasets demonstrated the effectiveness of the proposed method in terms of performance and interpretability.